We can see that the distribution is almost uniform for all the short description categories except for a few

Capstone Project: Group 2 NLP-1

DOMAIN: Automated Ticketing System

Participants:

Introduction

The AS-IS State

Currently,

Pain points

Opportunity

Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.

PROJECT OBJECTIVE: Build a classifier that can classify the tickets by analyzing text.

1. Importing necessary libraries

2. EDA and Data Visualization

Fetching the given file

2.1 Exploring the given Data files

Data has four columns:

2.2. Understanding the structure of data

There are 8500 rows in the dataset

2.3 Finding missing points in data

We need to either remove or feed such entries before processing the data

We will analyze the data further and derive more insights later

2.4 Finding inconsistencies in the data

2.5 Visualizing different patterns

Plotting the pie chart showing percentage distribution of tickets against different functional groups

Observation:

Plotting Frequency distribution in descending order, across various groups which were assigned the tickets

Observation:

Plotting the frequency distribution once more without the outlier GRP_0

Finding groups which hardly recieve tickets

Observation:

Observation:

Observations from the histogram are confimred with the pie-chart which is showing percentage distributions of tickets assigned

Checking whether there is any Short Description category which has maximum number of tickets

Obeservation: As seen, there is not any defined short description categories which receive good number of tickets. Hence we need to apply NLP techniques for classification

3. Pre-processing of data

3.1. Dealing with missing values

3.2 Data Cleansing

We need to create a custom function for data cleansing as the standard nltk stpwords would not work for:

Observation: Some of the sentences sound German in nature which needs to be transalted back to English. We need to first identify all such sentenes

3.3 Language Detection & Translation

3.4 Translating both "Short Description" and "Description" columns using google translator if language is not English

3.5 Remove Non Dictionary characters like 'xd' from both the columns 'Short description' and 'Description'

Observation: There are 1 entire entry in "Short description" and 57 entries in "Description" which are being removed because of which the row count is decreasing which is causing issue while updating the dataframe with original rowcount - Shown in the snapshots below

To address this we should merge both the columns which would also help in NLP model building in the due course

3.6 Merging "Short description" and "Description" into one column i.e. "merged"

3.7 Recording the serial numbers of rows for which the language is not English

Conclusion:

As shown above, there were around 852 rows for which the langauge was not English. Such non english langages were: af', 'ca', 'cs', 'cy', 'da', 'de', 'es', 'et', 'fi', 'fr', 'hr', 'id', 'it', 'lt', 'lv', 'nl', 'no', 'pl', 'pt', 'ro', 'sk', 'sl', 'so', 'sq', 'sv' and 'tl

4. Stemming, Tokenization and Lemmatization after stopwords removal

4.1 Applying Stemming and Tokenization

Conclusion: We can see that:

5. Transforming the text data into bigrams and trigrams for plotting "Word Clouds"

Topic Modeling

5.1 Converting sentences to words

5.2. Creating bigram and trigram models

5.3. Creating WordClouds for each of the groups

5.4. Plotting overall wordcloud

5.5. Plotting wordcloud with top 100 words

5.6. Wordclouds for each assignment group

6. Vectorization of text data

6.1. Creating a column 'list_processed_text' to store all the processed texts in form of lists

6.2. Creating word vectors from list of processed texts

6.3 Tokenizing the data

encoded_docs matrix is the tokenized data for the processed text data which will be used later for model building

7. NN Modeling for Classification

7.1 Padding the tokenized data to a max length of 4 for using in the model building later

Creation of word embedding using pre-trained file "glove.6B.100d.txt"

7.2 Creating a weight matrix for words in training docs

7.3 Compliling the NLP model

7.4. Creating a target column by label encoding

Label encoding the target column

7.5 Fitting the model with data obtained through pre-trained glove embedding model

Conclusion: Accuracy of the model trained using NN was found around 50%

8. Applying Suprevised Learning Classification techniques with normal count vectorized data

8.1. Splitting data into Train and test sets

We have sucessfully split the dataset into training and test sets in the ratio of 80:20

8.2 Multinomial Naïve Bayes Classifier -- Using Default hyperparamters

8.3 KNeighborsClassifier Classifier -- Using Default hyperparamters

run_classification(KNeighborsClassifier(), X_train, X_test, y_train, y_test)

8.4. DecisionTreeClassifier -- Using Default hyperparamters

8.5. Random Forest Classifier -- Using Default hyperparamters

8.6. XG Boost Classifier -- Using Default hyperparamters

9. Generating input datasets for modelling and classification

Generating several sets of training and testing datasets based on padding length for each text row entry

9.1. Padding the sequences with length of 28 to accomodate the context for each row

9.2. Padding with the maximum row length in the dataset

Maximum length was found = 1261 words

9.3. Padding with the median row length in the dataset

9. "Hyperparameter Tuning" of the Classification techniques applied above -- with padded data with 28 words per row

9.1 Applying GridSearch CV technique for Decision Tree classifier -- with padded data with 28 words per row

Observation: Decision Tree classifier with Grid search CV technique, with padded data with 28 words, has given accuracy of 50.3%

9.2 Applying GridSearch CV technique for Decision Tree classifier -- with padded data with 1261 words per row

Observation: Decision Tree classifier with Grid search CV technique, with padded data with 1261 words, has given accuracy of 51.4%

9.3 Applying GridSearch CV technique for Random Forest classifier with padded data -- with padded data with 28 words per row

Observation: Random Forest Classifier with Grid search CV technique, with padded data with 28 words, has given accuracy of 57%

9.4 Applying Random Search CV technique for Decision Tree Classifier -- with padded data with 28 words per row

Observation: Decision Tree Classifier with random search CV technique, with padded data with 28 words, has given accuracy of 51%

9.4 Applying Random Search CV technique for Randome Forest Classifier -- with padded data with 28 words per row

Observation: RF Classifier with random search CV technique, with padded data with 28 words has given accuracy of 56%

10. LSTM Modeling for Classification

10.1. Creating embedding layer

10.3. Using the padded sequence with max length = 28 words per row

10.4. Creating target variable

10.5. Splitting into train and test data

10.6. Building the LSTM model

10.7. Evaluating the model

Observation: Found an accuracy of 57% with LSTM classifier which is higher than NN model built with glove vectorized data

11 Suprevised Learning Classification with glove vectorized data from pre-trained model

11.1 Multinomial Naïve Bayes Classifier -- Using Default Hyperparameters

Splitting into train and test data

Observation: Found an accuracy of -- % with Multinomial NB classifier (with default hyperparameters) with glove vectorized data

11.2 KNeighborsClassifier Classifier -- -- Using Default Hyperparameters

Observation: Found an accuracy of -- % with K Neighbors classifier (with default hyperparameters) with glove vectorized data

11.3 DecisionTreeClassifier -- Using Default Hyperparameters

Observation: Found an accuracy of -- % with Decision Tree classifier (with default hyperparameters) with glove vectorized data

11.4 Random Forest Classifier -- -- Using Default Hyperparameters

Observation: Found an accuracy of -- % with RF classifier (with default hyperparameters) with glove vectorized data

11.5 XG Boost Classifier -- Using Default Hyperparameters

Observation: Found an accuracy of -- % with XG Boost classifier (with default hyperparameters) with glove vectorized data

12. Hyperparameter tuning of Classification techniques and using normal count vectorized data without padding

12.1. Creating count vectorized data without padding

12.2: Applying the tuning for Multinomial NB

Observation: Found an accuracy of 67 % with Multinomial NB with tuned hyperparameters with count vectorized data

12.3: Applying the tuning for KNN

Observation: Found an accuracy of 67 % with Multinomial NB with tuned hyperparameters with count vectorized data

12.4: Applying the tuning for SVM technique

12.5: Applying the tuning for Decision Tree Classifier technique

12.6: Applying the tuning for Random Forest Classifier technique

13. Applying conditional step-by-step classifiers to handle data imbalanace

13.1. Creating a separate column to store Assignment groups with only GRP_0 labelled as it is and the rest labelled as "Others"

13.2. Splitting the dataset into train and test sets for final evaluation of the conditional step-by-step classifier

13.3. Splitting the dataset into train and test sets for further processing through conditional step-by-step classifier

13.4. Build classifier 1 for GRP_0 vs 'others' classification

13.4.1. Splitting the train and test sets for classifier-1 for GRP_0 vs 'others' classification - Using overall train set created above for step classification
13.4.2. Creating the KNN classifier
13.4.3. Creating the XG Boost classifier
Observation: We will use XG Boost classifier as classifier1 as that's giving the accuracy of 84%

13.5. Build classifier with only 'Others' group data

13.5.1. Build dataframe with only 'Others' group data
13.5.2. Resampling the data
13.5.3. Checking the distribution after resampling

We can see above that all the groups are distributed uniformly now

13.5.4. Vectorize the X data
13.5.5. Splitting the train and test sets for classifier-2 for rest of the groups classification
13.5.6. Build classifiers for all the groups except GRP_0 filtered out before
13.5.6.1. Applying the tuned hyperparameters as obtained in section 12.3 for KNN classifier
Observation: We will use KNN as classifier 2 as that gives maximum accuracy with re-sampled data as 93.2%

13.6. Applying a function to apply the classifier 2 conditionally based on result of classifier 1

13.7. Evaluating the overall condition based classifier with test data formed initially

Conclusion: We found the overall accuracy of condition based classifier as 77%

14. Comparing the results obtained from all the techniques

12.1 Forming X and Y axis

12.1 Plotting the accuracies

Final Conclusion

Comparison Results and classifier with the best result

The maximum accuracy was found with K Neighbors classifier of 66% - After applying Hyperparameter tuning on tokenized ticket description data (without padding) Below are the top 3 classifiers along with the methods:

K Neighbors with hp tuning (66%) > Multinomial NB with hp tuning (65%) > K Neighbors, XG Boost and SVM (without hp tuning)

Insights

Data Quality and Recommendations

To address these shortcomings, we had to define custom function to selectively remove these junk values from the dataset

Data Veracity

Given data looked genuine with “Caller IDs” provided for each ticket. In industry practice, if there is a possibility of malicious tickets, we can build a solution to first verify the registered Caller IDs and then process the concerned tickets

Implementation Suggestions

The proposed classifier should be kept updated with by periodically monitoring the performance/ accuracy and accordingly refitting the models with updated data.